Data Visualization Lab - Sebastiano Cassol - id: 229318¶

04 - 09 - 2023

Exercise 1¶

In [1]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex1.png')
plt.imshow(img)
plt.show()

A chart is basically made of visual encoding and chart apparatus. Chart apparatus refers to features related to the type of chart that has ben chosen, while the visual encoding elements are customized. In this case, the chart is a bubble plot, showing the changes in spending from a specific year (2019) and a specific week (the one that ends with April 1).

Fist, we need to make a first distinction in the visual encoding elements, between marks and annotations (or attributes):

  • Marks consist in the way the data is reported in the chart e.g. dots, lines, area and so on. Marks are also related to the dimension we deal with.
  • Annotations could be quantity, size, colours and every other attribute that can represent a quantiti or a relation. The main goal, when representing data is to find the right blend of marks and annotations that most effectively portray the angle of analysis you wish to show.

In this case, marks are dots representing:

  • the category/type of expanses, every node represent a specific type of expanses (e.g. supermaket, video streaming, fast food and so on).
  • the variations in spending, with quantity represented with the size of the dots (or the bubbles!). Note that the size could be determined by the area or the diameter.

As mentioned before, annotations could be quantitative or categorical attributes. In this case, a quantitative attribute present in this graph are is the position of the dots respect to the center of the chart, that suggest the variation of the spendings in percentage respect to the reference measure. It is also present a categorical attribute, the color of the dots, that suggest if a certain expanse is higher or lower respect to the compared expanse.

Finally, we can notice that there are no legend in this graph helping us, but there are some text annotations that identify certain categories of expanse.

Exercise 2¶

There are three types of visual experience:

  • explanatory
  • exhibitory
  • exploratory

Explanatory¶

Explanatory visual experience aim to offer a detailed and comprehensive understanding of a topic through visuals. Thus, it takes the responsibility to bring key insights in the foreground rather then leaving the responsibility to the viewer. The main goal is a self-explanatory chart, the reader should not need in-person explanation. This type of visual experience typically involve illustrations, diagrams or animations to guide the viewer. Annotations play an important role in this case, since colors, captions and annotations assist the viewer in the interpretation.

Exhibitory¶

Exhibitory visual experience is focused on presenting information/data in a clear and visually good way. In this case, no exploration, no interaction and no explanations are involved. Typically charts, graphs, diagrams and infographics are involved. The viewer has to interpret the meaning on his own and he need to know the context and the content, since this visual experience is often offered to a very specific audience or support written articles/reports.

Exploratory¶

Exploratory visual experience allows users to interact and explore the data. Here the viewer still nedd to find his own insights, but it is assisted by an interactive visualization. Typically, this type of visual experience, lets the user highlight or filter by categories of interest, change data parameters, switch views, get annotations hovering on components and so on. A nice way to explore data in depth.

Here are an example for each visual experience previously explained:

  • Here is an explanatory visual experience that shows some informations about the new york city skyline, highlighting important concepts with annotations.
In [2]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_1.png')
plt.imshow(img)
plt.show()
  • Here is an exploratory visual experience that let the user reorder and interact with data about diversity in tech.
In [3]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_2.png')
plt.imshow(img)
plt.show()
  • Here is another exploratory visual experience that let the user reorder and interact with data about NFL salaries.
In [4]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_3.png')
plt.imshow(img)
plt.show()
  • Finally, here is an exhibitory visual experience that simply represent the global temperature change thourgh the years.
In [5]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_4.png')
plt.imshow(img)
plt.show()

Exercise 3¶

UMAP (Uniform Manifold Apporximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are both dimensionality reduction techniques. The explain the sentence, it is necessary to go step-by-step. The main goal of this two techniques is to reduce demension of high-dimensional data, preserving structure and relationship. To do so, UMAP transform the original high-dimensional space into a lower-dimensional space. UMAP tries to find a lower-dimensional representation of the data, that still captures the relationship (previously present) between data points. Thus, the sentence says "induce a space transformation" because UMAP construct a low-dimensional representation of the data, while t-SNE map the data points from the high-dimensional space to a lower-dimensional space preserving the similarities between points and don't explicitly construct a lower-dimensional representation.

Here are some comparisons between UMAP and t-SNE (in biology).

In [6]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex3.png')
plt.imshow(img)
plt.show()

Both perform well, but UMAP seems to be able to accentuate and highlight better clusters and relationship respect to t-SNE. Anyway, I surely need a more in-depth search to better show the differences between these two techniques.

Exercise 4¶

In [7]:
# imports
import pandas as pd
In [8]:
# read cdv
employment = pd.read_csv('datasets/employment_italy.csv')
In [9]:
# visualize data
employment.head()
Out[9]:
ITTER107 Territorio TIPO_DATO_FOL Tipo dato SEXISTAT1 Sesso ETA1 Classe di età TITOLO_STUDIO Titolo di studio CITTADINANZA Cittadinanza TIME Seleziona periodo Value Flag Codes Flags
0 ITC1 Piemonte EMP_R tasso di occupazione 1 maschi Y15-24 15-24 anni 99 totale TOTAL totale 2018 2018 25.101074 NaN NaN
1 ITC1 Piemonte EMP_R tasso di occupazione 1 maschi Y15-24 15-24 anni 99 totale TOTAL totale 2019 2019 23.817206 NaN NaN
2 ITC1 Piemonte EMP_R tasso di occupazione 1 maschi Y15-24 15-24 anni 99 totale TOTAL totale 2020 2020 24.314688 NaN NaN
3 ITC1 Piemonte EMP_R tasso di occupazione 1 maschi Y15-24 15-24 anni 99 totale TOTAL totale 2021 2021 25.062330 NaN NaN
4 ITC1 Piemonte EMP_R tasso di occupazione 1 maschi Y15-24 15-24 anni 99 totale TOTAL totale 2022 2022 23.491588 NaN NaN
In [10]:
for i in range(len(employment)):
    if employment['Territorio'][i] == 'Trentino Alto Adige / Südtirol':
        employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
    if employment['Territorio'][i] == 'Provincia Autonoma Bolzano / Bozen':
        employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
    if employment['Territorio'][i] == 'Provincia Autonoma Trento':
        employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
    if employment['Territorio'][i] == "Valle d'Aosta / Vallée d'Aoste":
        employment['Territorio'][i] = "Valle d'Aosta/Vallée d'Aoste"
        
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  employment['Territorio'][i] = "Valle d'Aosta/Vallée d'Aoste"
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
In [11]:
# prepare data for bar plot
employment_female = employment[employment['Sesso'] == 'femmine']

employment_female_20_64 = employment_female[employment_female['ETA1'] == 'Y20-64']

employment_graduated_female = employment_female_20_64[employment_female_20_64['Titolo di studio'] == 'laurea e post-laurea']

regions = employment_graduated_female['Territorio'].unique().tolist()

employment_graduated_female
Out[11]:
ITTER107 Territorio TIPO_DATO_FOL Tipo dato SEXISTAT1 Sesso ETA1 Classe di età TITOLO_STUDIO Titolo di studio CITTADINANZA Cittadinanza TIME Seleziona periodo Value Flag Codes Flags
6452 ITG1 Sicilia EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2018 2018 61.598687 NaN NaN
6453 ITG1 Sicilia EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2019 2019 63.168572 NaN NaN
6454 ITG1 Sicilia EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2020 2020 63.525725 NaN NaN
6455 ITG1 Sicilia EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2021 2021 65.130276 NaN NaN
6456 ITG1 Sicilia EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2022 2022 66.394632 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7052 ITD3 Veneto EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2018 2018 80.715351 NaN NaN
7053 ITD3 Veneto EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2019 2019 79.953319 NaN NaN
7054 ITD3 Veneto EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2020 2020 75.534940 NaN NaN
7055 ITD3 Veneto EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2021 2021 81.357479 NaN NaN
7056 ITD3 Veneto EMP_R tasso di occupazione 2 femmine Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2022 2022 83.508957 NaN NaN

110 rows × 17 columns

In [12]:
# prepare data for choropleth map
employment_2022 = employment[employment['TIME'] == '2022']
employment_graduated_2022 = employment_2022[employment_2022['Titolo di studio'] == 'laurea e post-laurea']
employment_2022_20_64 = employment_graduated_2022[employment_graduated_2022['ETA1'] == 'Y20-64']

employment_2022_20_64.head()
Out[12]:
ITTER107 Territorio TIPO_DATO_FOL Tipo dato SEXISTAT1 Sesso ETA1 Classe di età TITOLO_STUDIO Titolo di studio CITTADINANZA Cittadinanza TIME Seleziona periodo Value Flag Codes Flags
6406 ITC2 Valle d'Aosta/Vallée d'Aoste EMP_R tasso di occupazione 9 totale Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2022 2022 84.496008 NaN NaN
6411 ITD4 Friuli-Venezia Giulia EMP_R tasso di occupazione 9 totale Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2022 2022 84.298419 NaN NaN
6416 ITDA Trentino-Alto Adige/Südtirol EMP_R tasso di occupazione 9 totale Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2022 2022 85.271396 NaN NaN
6431 ITF4 Puglia EMP_R tasso di occupazione 9 totale Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2022 2022 73.421410 NaN NaN
6436 ITF3 Campania EMP_R tasso di occupazione 9 totale Y20-64 20-64 anni 11 laurea e post-laurea TOTAL totale 2022 2022 71.129689 NaN NaN
In [13]:
# show regions
regions
Out[13]:
['Sicilia',
 'Calabria',
 'Toscana',
 'Friuli-Venezia Giulia',
 "Valle d'Aosta/Vallée d'Aoste",
 'Trentino-Alto Adige/Südtirol',
 'Puglia',
 'Campania',
 'Abruzzo',
 'Lazio',
 'Umbria',
 'Emilia-Romagna',
 'Liguria',
 'Lombardia',
 'Marche',
 'Basilicata',
 'Molise',
 'Piemonte',
 'Sardegna',
 'Veneto']
In [14]:
# imports
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
In [15]:
# plot: employment rate in Italy - male vs. female

fig, ax1 = plt.subplots(1, 1, figsize=(25, 10))

plt.suptitle('Employment rate of graduated female (y: 20-64) in Italy', fontsize=25)

sns.barplot(
    ax = ax1,
    data = employment_graduated_female,
    x = 'Territorio',
    y = 'Value',
    hue = 'TIME'
)
ax1.set_xlabel("Region", fontsize=15)
ax1.set_ylabel("Employment rate", fontsize=15)
ax1.set_xticks(np.arange(0, len(regions), 1))
ax1.set_xticklabels(regions, rotation=90, fontsize=10)
ax1.grid(alpha=.4)

fig.tight_layout()
fig.show()
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/1395905003.py:21: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()
In [16]:
import json
with open('geojson/limits_IT_regions.geojson') as f:
    italy = json.load(f)

for feature in italy['features']:
    print(feature['properties']['reg_name'])
Piemonte
Valle d'Aosta/Vallée d'Aoste
Lombardia
Trentino-Alto Adige/Südtirol
Veneto
Friuli-Venezia Giulia
Liguria
Emilia-Romagna
Toscana
Umbria
Marche
Lazio
Abruzzo
Molise
Campania
Puglia
Basilicata
Calabria
Sicilia
Sardegna
In [17]:
# get min and max for 'Value'
min_value = employment_2022_20_64['Value'].min()
max_value = employment_2022_20_64['Value'].max()
In [18]:
fig2 = px.choropleth_mapbox(
    employment_graduated_female,
    geojson=italy,
    locations='Territorio',
    featureidkey='properties.reg_name',
    color='Value',
    color_continuous_scale="Viridis",
    range_color=(min_value, max_value),
    labels={'Value':'Employment rate', 'Territorio':'Region'},
    title="Employment rate of graduated female (20-64) in Italy (2022)",
    hover_data=['Territorio', 'Value'],
    center={"lat": 41.8719, "lon": 12.5674},
    mapbox_style="carto-positron",
    zoom=4
)

fig2.update_geos(showcountries=False,
                 showcoastlines=False,
                 showland=False,
                 fitbounds='locations')
fig2.update_layout(margin={"r":0, "t":40, "l":0, "b":0})
fig2.show()

Exercise 5¶

In [19]:
# imports
import pandas as pd
In [20]:
# read csv
ionosphere = pd.read_csv('datasets/ionosphere.csv', header=None)

ionosphere
Out[20]:
0 1 2 3 4 5 6 7 8 9 ... 23 24 25 26 27 28 29 30 31 32
0 0.99539 -0.05889 0.85243 0.02306 0.83398 -0.37708 1.00000 0.03760 0.85243 -0.17755 ... -0.51171 0.41078 -0.46168 0.21266 -0.34090 0.42267 -0.54487 0.18641 -0.45300 g
1 1.00000 -0.18829 0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 ... -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447 b
2 1.00000 -0.03365 1.00000 0.00485 1.00000 -0.12062 0.88965 0.01198 0.73082 0.05346 ... -0.40220 0.58984 -0.22145 0.43100 -0.17365 0.60436 -0.24180 0.56045 -0.38238 g
3 1.00000 -0.45161 1.00000 1.00000 0.71216 -1.00000 0.00000 0.00000 0.00000 0.00000 ... 0.90695 0.51613 1.00000 1.00000 -0.20099 0.25682 1.00000 -0.32382 1.00000 b
4 1.00000 -0.02401 0.94140 0.06531 0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 ... -0.65158 0.13290 -0.53206 0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697 g
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
346 0.83508 0.08298 0.73739 -0.14706 0.84349 -0.05567 0.90441 -0.04622 0.89391 0.13130 ... -0.04202 0.83479 0.00123 1.00000 0.12815 0.86660 -0.10714 0.90546 -0.04307 g
347 0.95113 0.00419 0.95183 -0.02723 0.93438 -0.01920 0.94590 0.01606 0.96510 0.03281 ... 0.01361 0.93522 0.04925 0.93159 0.08168 0.94066 -0.00035 0.91483 0.04712 g
348 0.94701 -0.00034 0.93207 -0.03227 0.95177 -0.03431 0.95584 0.02446 0.94124 0.01766 ... 0.03193 0.92489 0.02542 0.92120 0.02242 0.92459 0.00442 0.92697 -0.00577 g
349 0.90608 -0.01657 0.98122 -0.01989 0.95691 -0.03646 0.85746 0.00110 0.89724 -0.03315 ... -0.02099 0.89147 -0.07760 0.82983 -0.17238 0.96022 -0.03757 0.87403 -0.16243 g
350 0.84710 0.13533 0.73638 -0.06151 0.87873 0.08260 0.88928 -0.09139 0.78735 0.06678 ... -0.15114 0.81147 -0.04822 0.78207 -0.00703 0.75747 -0.06678 0.85764 -0.06151 g

351 rows × 33 columns

In [21]:
# prepare fit data

fit_data = ionosphere.iloc[:, :-1]

fit_data.head()
Out[21]:
0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31
0 0.99539 -0.05889 0.85243 0.02306 0.83398 -0.37708 1.00000 0.03760 0.85243 -0.17755 ... 0.56811 -0.51171 0.41078 -0.46168 0.21266 -0.34090 0.42267 -0.54487 0.18641 -0.45300
1 1.00000 -0.18829 0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 ... -0.20332 -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447
2 1.00000 -0.03365 1.00000 0.00485 1.00000 -0.12062 0.88965 0.01198 0.73082 0.05346 ... 0.57528 -0.40220 0.58984 -0.22145 0.43100 -0.17365 0.60436 -0.24180 0.56045 -0.38238
3 1.00000 -0.45161 1.00000 1.00000 0.71216 -1.00000 0.00000 0.00000 0.00000 0.00000 ... 1.00000 0.90695 0.51613 1.00000 1.00000 -0.20099 0.25682 1.00000 -0.32382 1.00000
4 1.00000 -0.02401 0.94140 0.06531 0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 ... 0.03286 -0.65158 0.13290 -0.53206 0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697

5 rows × 32 columns

In [22]:
from sklearn.decomposition import PCA

# apply pca to ionosphere with 2 components
pca = PCA(n_components=2)


pca_result = pca.fit_transform(fit_data)

ionosphere['pca-one'] = pca_result[:, 0]
ionosphere['pca-two'] = pca_result[:, 1]
In [23]:
# start to try different t-SNE: perplexity = 5

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=5, n_iter=300)

tsne_results = tsne.fit_transform(fit_data)

ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]

# view t-SNE with perplexity = 5

plt.figure(figsize=(16, 10))

plt.suptitle('t-SNE with perplexity = 5', fontsize=25)

sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue=32,
    palette=sns.color_palette("hls", 2),
    data=ionosphere,
    legend="full",
    alpha=0.7,
)

plt.show()
[t-SNE] Computing 16 nearest neighbors...
[t-SNE] Indexed 351 samples in 0.001s...
[t-SNE] Computed neighbors for 351 samples in 0.050s...
[t-SNE] Computed conditional probabilities for sample 351 / 351
[t-SNE] Mean sigma: 0.281231
[t-SNE] KL divergence after 250 iterations with early exaggeration: 66.277557
[t-SNE] KL divergence after 300 iterations: 1.039823
In [24]:
# t-SNE with perplexity 25

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=25, n_iter=300)

tsne_results = tsne.fit_transform(fit_data)

ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]

# view t-SNE with perplexity = 25

plt.figure(figsize=(16, 10))

plt.suptitle('t-SNE with perplexity = 25', fontsize=25)

sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue=32,
    palette=sns.color_palette("hls", 2),
    data=ionosphere,
    legend="full",
    alpha=0.7,
)

plt.show()
[t-SNE] Computing 76 nearest neighbors...
[t-SNE] Indexed 351 samples in 0.000s...
[t-SNE] Computed neighbors for 351 samples in 0.007s...
[t-SNE] Computed conditional probabilities for sample 351 / 351
[t-SNE] Mean sigma: 0.588407
[t-SNE] KL divergence after 250 iterations with early exaggeration: 54.903748
[t-SNE] KL divergence after 300 iterations: 0.619860
In [25]:
# t-SNE with perplexity 40

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)

tsne_results = tsne.fit_transform(fit_data)

ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]

# view t-SNE with perplexity = 40

plt.figure(figsize=(16, 10))

plt.suptitle('t-SNE with perplexity = 40', fontsize=25)

sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue=32,
    palette=sns.color_palette("hls", 2),
    data=ionosphere,
    legend="full",
    alpha=0.7,
)

plt.show()
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 351 samples in 0.001s...
[t-SNE] Computed neighbors for 351 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 351 / 351
[t-SNE] Mean sigma: 0.747613
[t-SNE] KL divergence after 250 iterations with early exaggeration: 50.705322
[t-SNE] KL divergence after 300 iterations: 0.505443
In [26]:
# t-sne with perplexity = 25 and iteration 1000

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=25, n_iter=1000)

tsne_results = tsne.fit_transform(fit_data)

ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]

# view t-SNE with perplexity = 25

plt.figure(figsize=(16, 10))

plt.suptitle('t-SNE with perplexity = 25 and iteration 1000', fontsize=25)

sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue=32,
    palette=sns.color_palette("hls", 2),
    data=ionosphere,
    legend="full",
    alpha=0.7,
)

plt.show()
[t-SNE] Computing 76 nearest neighbors...
[t-SNE] Indexed 351 samples in 0.000s...
[t-SNE] Computed neighbors for 351 samples in 0.006s...
[t-SNE] Computed conditional probabilities for sample 351 / 351
[t-SNE] Mean sigma: 0.588407
[t-SNE] KL divergence after 250 iterations with early exaggeration: 54.303207
[t-SNE] KL divergence after 1000 iterations: 0.540016
In [27]:
# t-sne with perplexity = 50 and iteration 1000

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=1000)

tsne_results = tsne.fit_transform(fit_data)

ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]

# view t-SNE with perplexity = 50

plt.figure(figsize=(16, 10))

plt.suptitle('t-SNE with perplexity = 50 and iteration 1000', fontsize=25)

sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue=32,
    palette=sns.color_palette("hls", 2),
    data=ionosphere,
    legend="full",
    alpha=0.7,
)

plt.show()
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 351 samples in 0.001s...
[t-SNE] Computed neighbors for 351 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 351 / 351
[t-SNE] Mean sigma: 0.850265
[t-SNE] KL divergence after 250 iterations with early exaggeration: 49.105682
[t-SNE] KL divergence after 1000 iterations: 0.380867
In [28]:
# print  PCA and t-SNE

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(25, 15))

fig.suptitle('PCA vs. t-SNE', fontsize=25)

sns.scatterplot(
    x="pca-one", y="pca-two",
    hue=32,
    palette=sns.color_palette("hls", 2),
    data=ionosphere,
    legend="full",
    alpha=0.7,
    ax=ax1
)

sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue=32,
    palette=sns.color_palette("hls", 2),
    data=ionosphere,
    legend="full",
    alpha=0.7,
    ax=ax2
)

fig.tight_layout()
fig.show()
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3003667294.py:28: UserWarning:

Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.

From this graph, we can notice that t-SNE performs better for our dataset. Surely we need a more accurate testing to absolutely state which is the better for our dataset. However it is not easy to find the optimized parameters for t-SNE. Apparently, it seems that a good compromise is a perplexity of 40 with a lower number of iteration respect to the last runs with a high number of iterations.

Exercise 6¶

In [29]:
# imports

import pandas as pd
import plotly.express as px
In [30]:
# read csv

data = pd.read_csv('datasets/mydata.csv')

data['Variable'] = data['Variable'].astype(float)
In [31]:
# import geojson
import json
with open('geojson/usa.geo.json') as f:
    usa = json.load(f)

for feature in usa['features']:
    print(feature['properties']['NAME'])

len(usa['features'])
Maine
Massachusetts
Michigan
Montana
Nevada
New Jersey
New York
North Carolina
Ohio
Pennsylvania
Rhode Island
Tennessee
Texas
Utah
Washington
Wisconsin
Puerto Rico
Maryland
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Minnesota
Mississippi
Missouri
Nebraska
New Hampshire
New Mexico
North Dakota
Oklahoma
Oregon
South Carolina
South Dakota
Vermont
Virginia
West Virginia
Wyoming
Out[31]:
52
In [32]:
# get states from dataset
states = data['Name'].unique().tolist()

len(states)
Out[32]:
50
In [33]:
# get min and max for variable
min_var = data['Variable'].min()
max_var = data['Variable'].max()
In [34]:
fig = px.choropleth(data, 
                    geojson=usa, 
                    featureidkey='properties.NAME',
                    locations='Name', 
                    color='Variable',
                    color_continuous_scale="BuPu",
                    range_color=(min_var, max_var),
                    scope="usa",
                    labels={'Variable':'Variable'},
                    )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.update_geos(
    lataxis_showgrid=True, 
    lonaxis_showgrid=True,
    bgcolor="lightgrey",
    landcolor="black")
fig.show()